Component Failure Prediction

Data Processing File

Task

  1. Predict Component Level (serial number) unscheduled removals (Other file)
  2. Show different stages of ML Pipeline (EDA, feature selection, data manipulation) quick link
  3. Bring ML Pipeline in production (TBU)
  4. Provide API endpoints (TBU)
  5. requirements.txt file quick link

Index

ML Pipeline

1. Libraries

Go to Index

2. Load Data

Go to Index

2.1 Data review

3. Data Cleaning

Go to Index

ASSUMPTION/RESEARCH FACT:

Predictive modeling is affected by missing data; theoretically, the model can still perform better if the missing data is < 25-30. Otherwise, it might give wrong predictions (Although the model's performance will still be subjective to the data it is trained upon). Therefore, only considering the features with more than 70% of data entry.

Cleaning Notes: The data is a mix of textual and numerical data and is mostly cleaned

After performing initial checks, the current data only has empty values, is consistent and has no impossible values and nearly no outliers (based on eda and min/max analysis done in microsoft excel)

3.1 Null Checks

4. Feature Analysis

Go to Index

5 Feature Selection

Go to Index

Reasons for irrelevance of following column

6. Exploratory Data Analysis

Go to Index

Type of analysis done -

  1. Columnar data analysis
  2. Missing value check
  3. Correlation analysis
  4. Analysis with Target variable
  5. Outlier detection
  6. Data Distribution analysis
  7. EDA Inference

Inference from EDA is provided at end of subpoint 6 (After plots)

6.1 Columar analysis

6.2 Missing Value check

Go to Index

6.3 Correlation among data

Go to Index

6.4 Feature Plots with Target Variable

Go to Index

6.5 Outlier check for Time features vs target

Go to Index

6.6 Other textual categorical variable distribution

Go to Index

6.7 EDA Inference

Go to Index

Go to EDA

  1. From (6.1) and feature analysis, FLEET_CD holds unique numerical values indicative of aircraft type, therefore they can be input as dummy columns (similar to one hot encoding).
  2. Feature MEL_QTY (6.2) has many null values which needs to be imputed with technique of null value removal (mean substitution etc.)
  3. There is no correlation among the data, it is safe to train the model (6.3)
  4. AIRCRAFT_AGE has consistent values for both the target variables and has no exponentially high value (6.4)
  5. From (6.5), TIME_SINCE_INSTALL_CYCLES and TIME_SINCE_NEW_CYCLES seems to have high variation among the data which can skew the data results, based on the business rules and relevance of the column for prediction, they needs to be handled accordingly.
  6. Outlier check (6.5) For TIME_SINCE_INSTALL_CYCLES many data point lie beyond the maximum point (Q3 + 1.5 x IQR) of Interquartile Range of Box plot, indicating high variations, and for TIME_SINCE_NEW_CYCLES majorityof the data points lie outside the IQR Range, therfore dropping it.
  7. Analysis of REPETITIVE_FAULT_QT (6.6) highlights that majoirty of the data points are valued 0.

7. Data Processing

Go to Index

8. Feature Engineering

Go to Index

The Dataset Deals with numerical (float/int) and object type data (textual, date). Feature engineering is to convert the data to a uniform numerical format for the model to learn and predict better. Combined type data works well with RNN and other Deep NN.

Type Conversions

  1. Imputations Handling
    • MEL_QTY (Hnadling null values of the feature via imputing with mean data)
  2. Date-Time Conversion
    • INSTALL_DT (Streamlining the format of feature to extract date information)
    • REMOVAL_DT (Streamlining the format of feature to extract date information)
    • COMPONENT_LIFE_DAYS (Creating component life in days feature from the above two feature for model training)
    • AIRCRAFT_AGE (Streamlining the format of aircraft age with component life)
  3. Categorical Data

    Conversion, from text to numerical values using dummy creation, similar to one hot encoding

    • OPERATOR_CD
    • FLEET_CD

8.1 Imputations

Go to Index

Handling missing/empty values in the data

  1. Imputing Numerical values with mean 8.1.1
  2. Imputing categorical values with mode (most frequent term) 8.1.2

8.1.1 Numerical Imputations

8.1.2 Categorical Imputations

8.2 Date Conversion

Go to Index

8.3 Categorical Data Conversion

- Creating Dummies

Go to Index

Reckecking types

Saving Processed file

Go to Index

Requirements.txt (libraries used in session)

Click below to expand

EOF

Go to Index